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Abstract: With the advent of increasingly complex hardware in real-time embedded 
systems (processors with performance enhancing features such as pipelines, cache hi- 
erarchy, multiple cores), many processors now have a set-associative L2 cache. Thus, 
there is a need for considering cache hierarchies when validating the temporal behavior 
of real-time systems, in particular when estimating tasks' worst-case execution times 
(WCETs). To the best of our knowledge, there is only one approach for WCET esti- 
mation for systems with cache hierarchies ifTOl . which turns out to be unsafe for set- 
associative caches. 

In this paper, we highlight the conditions under which the approach described 
in ifTOl is unsafe. A safe static instruction cache analysis method is then presented. 
Contrary to fTOl our method supports set-associative and fully associative caches. The 
proposed method is experimented on medium-size and large programs. We show that 
the method is most of the time tight. We further show that in aU cases WCET esti- 
mations are much tighter when considering the cache hierarchy than when considering 
only the LI cache. An evaluation of the analysis time is conducted, demonstrating that 
analysing the cache hierarchy has a reasonable computation time. 

Key-words: WCET, hard real time systems, memory hierarchy, static analysis, ab- 
stract interpretation. 
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Analyse pire cas des hierarchies de caches d'instruction 
associatifs par ensemble 

Resume : Avec Tarrivee de materiel complexe dans les systemes temps-reel embarques 
(processeurs avec des fonctions d' amelioration des performances tel que les pipelines, 
les hierarchies de caches, les multi-coeurs), de nombreux processeurs ont maintenant 
des caches L2 associatifs par ensemble. Ainsi, considerer les hierarchies de caches 
lors de la validation du comportement temporel des systemes temps-reel, en particu- 
lier lors de I'estimation d'une borne superieure du pire temps d'execution des taches 
s'executant sur le systeme devient necessaire. A notre connaissance, il existe une 
seule approche traitant des hierarchies de caches pour le calcul de cette borne ifTol . 
qui s'avere etre non sure pour les caches associatifs par ensemble. 

Dans ce rapport, nous presentons les conditions pour lesquelles 1' approche decrite 
dans ifTOl est non sure. Une approche statique sure est presentee pour les caches 
d'instruction. A 1' oppose de 1101 . notre methode supporte les caches associatifs par 
ensemble et les caches totalement associatifs. Cette methode est experimentee sur des 
programmes de test ainsi qu'une application reelle. Nous montrons que notre methode 
est la plupart du temps precise et I'estimation du pire temps d'execution est toujours 
plus precise en considerant la hierarchie de cache comparativement a un seul niveau 
de cache. Une evaluation du temps de calcul est realisee montrant que 1' analyse de la 
hierarchie de cache est effectuee en un temps raisonnable. 

Mots-cles : pire temps d'execution,, temps-reel strict, hierarchie memoire, analyse 
statique, interpretation abstraite. 
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1 Introduction 

Cache memories have been introduced to decrease the access time to the information 
due to the increasing gap between fast micro-processors and relatively slower main 
memories. Caches are very efficient at reducing average-case memory latencies for 
applications with good spatial and temporal locality. Architectures with caches are 
now commonly used in embedded real-time systems due to the increasing demand for 
computing power of many embedded applications. 

In real-time systems it is crucial to prove that the execution of a task meets its 
deadline in all execution situations, including the worst-case. This proof needs an 
estimation of the worst-case execution times (WCETs) of any sequential task in the 
system. WCET estimates have to be safe (larger than or equal to any possible execution 
time). Moreover, they have to be tight (as close as possible to the actual worst-case 
execution time) to correctly dimension the ressources required by the system. 

The presence of caches in real-time systems makes the estimation of both safe and 
tight WCET bounds difficult due to the dynamic behavior of caches. Safely estimating 
WCET on architectures with caches requires a knowledge of all possible cache contents 
in every execution context, and requires some knowledge of the cache replacement 
policy. 

During the last decade, many research has been undertaken to predict WCET in 
architecture equipped with caches. Regarding instruction caches, static cache analysis 
methods have been designed, based on so-called static cache simulation ll20l [TTI or 
abstract interpretation 111^ l4l. Approaches for static data cache analysis have also 
been proposed ||5][T6l. Other approaches like cache locking have been suggested when 
the replacement policy is hard to predict precisely ifTSl or for data caches ifTol . The 
impact of multi-tasking has also been considered by approaches aiming at statically 
determining cache related preemption delays ifTSlfTTl . 

To the best of our knowledge, only lITOl deals with cache hierarchies. In this work, 
static cache analysis is applied to every level of the cache hierarchy. The memory ref- 
erence stream considered by the analysis at level L of the cache hierarchy (for example 
L2 cache) is a subset of the memory reference stream considered at level L — 1 (for 
example LI cache) when the analysis ensures that some references always hit at level 
i — 1. However, we show that the way references are filtered out in [10| is unsafe for 
set-associative caches. In this paper, we overcome this limitation through the proposal 
of a safe multi-level cache analysis of the cache structure for set-associative caches, 
whatever the degree of associativity. Our approach can be applied to caches with dif- 
ferent replacement poUcies thanks to the reuse of an existing cache analysis method. 

The paper presents experimental results showing that in most ot the cases the anal- 
ysis is tight. Furthermore, in all cases WCET estimations are much tighter when con- 
sidering the cache hierarchy than when considering the LI cache only. An evaluation 
of the analysis time is also presented, demonstrating that analysing the L2 cache has a 
reasonable computation time. 

The rest of the paper is organized as follows. Related work is surveyed in Section|2] 
Section [3]presents a counterexample showing that the approach presented in fTO] may 
produce underestimated WCET estimates when analysing set-associative caches. Sec- 
tion |4] then details our proposal. Experimental results are given in Section |5] Finally, 
Section|6]concludes with a summary of the contributions of this paper, and gives direc- 
tions for future work. 
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2 Related work 

Caches in real-time systems raise timing predictability issues due to their dynamic be- 
havior and their replacement policy. Many static analysis methods have been proposed 
in order to produce a safe WCET estimate on architectures with caches. 

To be safe, existing static cache analysis methods determine every possible cache 
contents at every point in the execution, considering all execution paths altogether. 
Possible cache contents can be represented as sets of concrete cache states IITSl or by 
a more compact representation called abstract cache states (ACS) lfT8l l4l [T0l[Tn . 

Two main classes of approaches ifTsl fTTl exist for static WCET analysis on archi- 
tectures with caches. 

In I.18J the approach is based on abstract interpretation 121 [3 and uses ACS. An 
Update function is defined to represent a memory access to the cache and a Join func- 
tion is defined to merge two different ACS in case there is an uncertainty on the path 
to be followed at run-time (e.g. at the end of a conditional construct). In this approach, 
three different analyses are applied which used fixpoint computation to determine: if a 
memory block is always present in the cache {Must analysis), if a memory block may 
be present in the cache {May analysis), and if a memory block will not be evicted after 
it has been first loaded {Persistence analysis). A cache categorisation (e.g. always-hit, 
first-miss) can then be assigned to every instruction based on the results of the three 
analyses. This approach originally designed for LRU caches has been extended for dif- 
ferent cache replacement policies in ||6l: Pseudo-LRU, Pseudo-Round-Robin. To our 
knowledge, this approach has not been extended to analyze multiple levels of caches. 
Our multi-level cache analysis will be defined as an extension of [18,1, mainly because 
of the theoretical results applicable when using abstract interpretation. 

In 191 [TTI . so-called static cache simulation is used to determine every possible 
content of the cache before each instruction. Static cache simulation computes abstract 
cache states using dataflow analysis. A cache categorisation {always-hit, always-miss, 
first-hit and first-miss) is used to classify the worst-case behavior of the cache for a 
given instruction. The base approach, initially designed for direct-mapped caches, was 
later extended to set-associative caches l20l . 

The cache analysis method presented in [9] has been extended to cache hierarchies 
in ITOl . A separate analysis of each memory level is performed by first analysing the 
behavior of the LI cache. The result of the analysis of the LI cache is consequently 
used as an input to the analysis of L2 cache, and so on. The approach considers an 
access to the next level of the memory hierarchy (e.g. L2 cache) if the access is not 
classified as always-hit in the current level (e.g. LI cache). As shown in Section |3] 
this filtering of memory accesses, although looking correct at the first glance, is unsafe 
for set-associative caches. Our work is based on the same principles as |10| (cache 
analysis for every level of the memory hierarchy, filtering of memory accesses), except 
that the unsafe behavior present in flOl is removed. Moreover, our paper presents an 
extensive evaluation of the performance of multi-level cache analysis, both in terms of 
tightness, and in terms of analysis time. 

3 Limitation of Mueller's approach 

The multi-level cache analysis method presented by F. Mueller in [10] performs a sep- 
arate analysis for each level in the memory hierarchy. The output of the analysis for 
level L is a classification of each memory references as first-miss, first-hit, always- 
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Figure 1 : Example of limitation for 2-ways LI and L2 caches 

miss, or always-hit, and is used as an input for the analysis of level L + 1. In ifTOl 
always-hit means that the reference is guaranteed to be in the cache; always-miss is 
used when a reference is not guaranteed to be in the cache (but may be in the cache for 
some execution paths);^rsf-/i/f andfirst-miss are used for references enclosed in loops, 
to distinguish the first execution from the others. All references are considered when 
analyzing level L + 1 exept those classified as always-hit at level L (or at a previous 
level). The implicit assumption behind this filtering of memory accesses is that when it 
cannot be guaranteed that a reference is a hit at level L, the worst-case situation occurs 
when a cache access to level L + 1 is performed. Unfortunately, this assumption is not 
safe as soon as the degree of associativity is greater than or equal to two, as shown on 
the counterexample depicted in Figure [T] 

The figure represents possible streams of memory references on a system with a 
LI 2-ways associative cache and a L2 2-ways associative cache, both with a LRU 
replacement policy. The safety problem is observed on reference x, assumed to be 
performed inside a function. References a, b, c, and d do not cause any safety problem 
(they cause misses in the LI and L2 both at analysis time and at run-time); they are 
introduced only to illustrate the safety problem on reference x. Let us assume that: 

— a and c map onto the same set as x in the LI cache and in the L2 cache. 

— b and d map onto the same set as x in the LI cache and map onto a different set 
than X in the L2 cache. This frequent case may occur because the size of the LI 
cache is smaller than the size of the L2 cache. 

The left part of the figure presents the contents of the abstract cache states at points 
pi, p2, p3 and p4 in the reference stream (only the sets where reference x is mapped are 
shown for the sake of conciseness), as well as the resulting classification. In the figure, 
{a, x} means that both a and x may be in the cache line. The right part of the figure 
presents the concrete cache contents at the same points when the worst-case execution 
path (WCEP), which takes the right path in the conditional construct, is followed. 

From the classification of reference x, the analysis outcome is 2 misses in the LI 
cache + 2 hits in the L2 cache. In contrast, executing the worst-case reference stream 
results in 1 hit in the LI cache + 1 miss in the LI cache + 1 miss in the L2 cache. 
Assuming an architecture where a miss is the worst-case and 2 * ThitL2 < TmissL2, 
the contribution to the WCET of the cache accesses to x when executing the code is 
larger than the one considered in the analysis, which is not safe. This counterexample 
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has been coded, in order to check that the counter- intuitive behavior of ifTOl actually 
occurs in practice. 

The safety problem found in fTol is due to the combination of severals factors: (i) 
the reference stream characteristics, (m) considering uncertain accesses as misses, {Hi) 
considering an access to the next level in such cases. 

To further explain the reasons of the safety problem, let us define the set reuse 
distance between two references to the same memory block for a cache level L as 
the position in the set (equivalent to its way) of the memory block when the second 
reference occurs. If the memory block is not present when the block is referenced for 
the second time then the set reuse distance is greater than the number of ways. For 
instance, the set reuse distance of x on Figure[T]at point p4 for Mueller's analysis is 3 
in the LI cache (greater than the number of LI ways) and 2 in the L2 cache (present in 
the second way). In contrast for the possible concrete cache this value is 3 (not present 
in LI cache) and 3 (not present in L2 cache). In lITOI . uncertain accesses are always 
propagated to the next cache level and the analysis may underestimate the set reuse 
distance. This underestimation then results in more hits in the next level in the analysis 
than in a worst-case execution. Our approach fixes the problem by enumerating the 
two possible behaviors of every uncertain access (i.e. considering that the access may 
occur or not). 

4 Multi-level set-associative instruction cache WCET 
analysis 

After a brief overview of the structure of our multi-level cache analysis framework 
(§ 14. Il l, we define in this section the classification of memory accesses (§ 14.21 ). and 
detail the analysis and prove its termination (§ 14.31 ). The use of the cache analysis 
outputs for WCET computation is presented in § 14.41 



4.1 Overview 

Our static multi-level set-associative instruction cache analysis is applied to each level 
of the cache hierarchy separately. The approach analyses the first cache level (LI 
cache) to classify every reference according to its worst-case cache behavior (always- 
hit, always-miss, first-hit, first-miss and not classified, see § 14.21 ). This cache hit/miss 
classification (CHMC) is not sufficient to know if an access to a memory block may 
occur at the next cache level (L2). Thus, a cache access classification (CAC) (Always, 
Never and Uncertain, see § 14. 2b is introduced to capture if it can be guaranteed that the 
next cache level will be accessed or not. 

The combination of the CHMC and the CAC at a given level is used as an input 
of the analysis of the next cache level in the memory hierarchy. Once all the cache 
levels have been analyzed, the cache classification of each level is used to estimate the 
WCET. This framework is illustrated in Figure|2] 

4.2 Cache classification 

Cache hit/miss classification 

Due to the semantic variation of the cache classification between static cache simu- 
lation [11] and abstract interpretation |18| approaches, we detail the cache hit/miss 
classification (CHMC) used in our analysis, similar to the one used in [IS]: 
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Figure 2: Multi-level cache analysis framework 

— always-hit (AH): the reference is guaranteed to be in cache, 

— always-miss (AM): the reference is guaranteed not to be in cache, 

— first-hit (FH): the reference is guaranteed to be in cache the first time it is ac- 
cessed, but is not guaranteed afterwards, 

— first-miss (FM): the reference is not guaranteed to be in cache the first time it is 
accessed, but is guaranteed afterwards, 

— not-classified (NC): the reference is not guaranteed to be in cache and is not 
guaranteed not to be in cache. 

Cache access classification 

In order to know if an access to a memory block may occur at a given cache level, 
we introduce a cache access classification (CAC). It is used as an input of the cache 
analysis of each level to decide if the block has to be considered by the analysis or not. 
The cache access category for a reference r at a cache level L is defined as follows: 

— N (Never): the access to r is never performed at cache level L, 

— A (Always): the access to r is always performed at cache level L, 

— U (Uncertain): it cannot be guaranteed that the access to r is always performed 
or is never performed at level L. 

The cache access classification for a reference r at a cache level L depends on 
the results of the cache analysis of the reference r at the level L — 1 (cache hit/miss 
classification, and cache access classification): 



CACr,L - I{CACr^L-l.CHMCr,L^{) 

The CAC for a reference r at level Lis N (never) when the cache hit/miss classifi- 
cation for r at a previous level is always-hit (i.e. it is guaranteed that accessing r will 
never require an access to cache level L). On the other side, the CAC for a reference 
r at level L is A for the first level of the cache hierarchy, or when CHMC and CAC 
at level L — 1 are respectively always-miss and A (i.e. it is guaranteed that accessing 
will always require an access to cache level L). The CAC for reference r at level L is 
U in all the other cases, expressing the uncertainty that the cache level L is accessed. 
As detailed in § 14.31 the cache analysis for U accesses explores the two cases where r 
accesses cache level L or not, to identify the worst-case. 
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Table [T] shows all the possible cases of cache access classifications for cache level 
L depending on the results of the analysis of level L — 1 (CACs and CHMCs). 



CACr,L^r~~~~~ __ 
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U 


U 


N 


u 
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N 
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N 


N 



Table 1 : Cache access classification: level L 

The table contents motivate the need of the cache access classification. Indeed, in 
case of an always-miss at level L — 1, determining if a reference r should be considered 
at level L requires more knowledge than the CHMC can provide: if r is always refer- 
enced at level L — 1 (CACr.L~i — A), it should also be considered at level L; similarly, 
if it is unsure that r is referenced at level L—1 {CACr,L-i — U), the reference is still 
unsure at level L. 
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a. Join function of IVIust analysis b. Update function of Must analysis 

Figure 3: Join and Update functions for the Must analysis with LRU replacement 

It also has to be noted that in the case of a iV access, the cache hit/miss classification 
can be disregarded because the value will be ignored during the WCET computation 
step for the considered level. 

4.3 Multi-level analysis 

The proposed multi-level analysis is based on a well known cache analysis method. The 
analysis presented in [18] is used, due to the theoretical results of abstract interpretation 
lIllO, and the support for multiple replacement policies fTS*, "SI (LRU, Pseudo-LRU, 
Pseudo-Round-Robin). Nevertheless, our analysis can also be integrated into the static 
cache simulation method [ 1 1 1 . 

The method detailed in [ 18| is based on three separate fixpoint analyses applied on 
the program control flow graph: 

— a Must analysis determines if a memory block is always present in the cache at a 
given point: if so, the block CHMC is always-hit; 

— a May analysis determines if a memory block may be in the cache at a given 
point: if not, the block CHMC is always-miss. Otherwise, if not present at this 
point in the Must analysis and in the Persistence analysis the block CHMC is not 
classified; 

— a Persistence analysis determines if a memory block will not be evicted after it 
has been loaded; the CHMC of such blocks is first-miss. 

Abstract cache states are computed at every basic block. Two functions on the 
abstract domain, named Update, and Join are defined for each analysis: 
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— Function Update is called for every memory reference on an ACS to compute 
the new ACS resulting from the memory reference. This function considers both 
the cache replacement policy and the semantics of the analysis. 

^ Function Join is used to merge two different abstract cache states in the case 
when a basic block has two predecessors in the control flow graph, like for ex- 
ample at the end of a conditional construct. 

Figure |3]gives an example of the Join ([3] a) and Update (l3]b) functions for the Mms? 
analysis for a 2-ways set-associative cache with LRU replacement policy. As in this 
context sets are independent from each other, only one set is depicted. A concept of age 
is associated with the cache block of the same set. The smaller the block age the more 
recent the access to the block. For the Must analysis, a memory block b is stored only 
once in the ACS, with its maximum age. It means that its actual age at run-time will 
always be lower than or equal to its age in the ACS. The Join and Update functions 
are defined as follows for the Must analysis with LRU replacement (see Figure|3]i: 

— The Join function applied to two ACS results in an ACS containing only the 
references present in the two input ACS and with their maximal age. 

— The Update function performs an access to a memory reference c using an input 
abstract cache state AC Sin (the abstract cache state before the memory access) 
and produces an output abstract cache state ACSout (the abstract cache state 
after the memory access). The Update function maps c onto its ACSout set 
with the younger age and increases the age of the other memory blocks present 
in the same set in ACSm- When the age of a memory block is higher than the 
number of ways, the memory block is evicted from ACSout- 

For the other analyses (May and Persistence), the approach is similar and the Join 
function is defined as follows: 

— May analysis: union of references present in the ACS and with their minimal 
age; 

— Persistence analysis: union of references present in the ACS and with their max- 
imal age. 

For more details see ifTSl and for the other replacement policies see . 

Extending [18] to multi-level caches does not require any change in the original 
analysis framework. Only the base functions have to be modified to take into account 
the uncertainty of some references at a given cache level, expressed by the cache ac- 
cess classifications (CAC). Function Join needs not be modified. Function Update 
(named hereafter Update„i to distinguish our function from the original one) is defined 
as follows, depending on the CAC of the currently analyzed reference r: 

• A (Always) access. In the case of an A access the original Update function is 
used. 

ACSout = Update{ACSin, r) ; Update^ <^ Update 

• N (Never) access. In the case of a A^ access, the analysis does not consider this 
access at the current cache level, so the abstract cache state stays unchanged. 

ACSout = AC Sin ', Update^ <^ identity 
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U (Uncertain) access. In the case of an U access, the analysis deals with the 
uncertainty of the access by considering the two possible alternative sub-cases 
(see figurelUfor an illustration): 

— the access is performed. The result is then the same as an A access; 

— the access is not performed. The result is then the same as a A^ access. 

To obtain the ACSout produced by an U access, we merge this two different 
abstract cache states by the Join function. 

ACSout = Join{Update{ACSin,r),ACSin) 
Updatem.{ACSin,r) = Join{Update{ACSi„,r), ACSi„) 




Join function 

Join(Update(ACSi„r),ACSi„ 



Figure 4: UpdatCm function for U access 

The original functions Join and Update produce a safe hit/miss classification of 
the memory references. In our case, this validity is kept for the A accesses and is 
obvious for the N accesses. As for the U accesses, which are the key to ensure safety, 
the analyses have to keep the semantics of each analysis. For the Must and Persistence 
analyses, the UpdatCm function maintains the maximal age of each memory reference 
by the original Join function applied to the two ACS (access occurs or not). Similarly, 
for the May analysis, the minimal age is kept by the UpdatCm function. So the semantic 
of each analysis is maintained by the UpdatCm function. 

4.3.1 Termination of the analysis 

It is demonstrated in 1 18| that the domain of abstract cache states is finite and, more- 
over, that the Join and Update functions are monotonic. So, using ascending chains 
(every ascending chain is finite) proves the termination of the fixpoint computation. 

In our case, the only modification to fTsl is the Update function. Thus, to prove 
the termination of our analysis we have to prove that the modified function Update^ 
is monotonic for each type of cache access. 

Proof: for an A access. Update^ is identical to Update, so it is monotonic. For 
a N access Update^ is the identity function, so it is monotonic. Finally, for an U 
access, Updatem is a composition of Update and Join. As the composition of mono- 
tonic functions is monotonic, Updatem is then also monotonic. This guarantees the 
termination of our analysis for each type of cache access and thus for the whole analy- 
sis. D 

It is important to note that our analysis terminates for any monotonic Update/ Join 
functions. Thus, all Update/ Join functions defined in 1 1 8. M to model different re- 
placement policies can be directly reused. 
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4.4 WCET computation 

The result of the multi-level analysis gives the worst-case access time of each memory 
reference to the memory hierarchy. In other words, this analysis produces the contri- 
bution to the WCET of each memory reference, which can be included in well-known 
WCET computation methods [15, 14|. 

In the formulae given below, the contribution to the WCET of a NC reference at 
level L is the latency of an access to level Lh-1, which is safe for architectures without 
timing anomalies caused by interactions between caches and pipelines, as defined in 
H). For architectures with such timing anomalies (e.g. architectures with out-of-order 
pipelines), more complex methods such as |7 1 have to be used to cope with the complex 
interactions between caches and pipelines. 



Name 


Description 


Code size 

(bytes) 


matmult 

ns 

bs 

minver 

jfdctint 

adpcm 


Multiplication of two 50x50 integer matrices 

Search in a multi-dimensional array 

Binary search for the array of 15 integer elements 

Inversion of floating point 3x3 matrix 

Integer implementation of the forward DCT (Discrete Cosine 

Transform) 

Adaptive pulse code modulation algorithm 


1200 

600 

336 

4408 

3040 

7740 


taskl 
task2 


Confidential 
Confidential 


12711 
12395 



Table 2: Benchmark characteristics 

We define the following notations: constant Thitg represents the cost in cycles 
of a hit at level (. (accesses to the main memory are always hits), first and next 
to distinguish the first and the successive execution in loops, the binary variables 
first_presentg{r) and nextjpresenti{r) represent that an access to reference r oc- 
curs (1) or not (0) at level L Finally, variables COST.first{r) and COST.next{r) 
give the contribution to the WCET of a reference r at a given point in the program, that 
can be used to compute the WCET. COST_first{r) and COSTjnext{r) are com- 
puted as follows: 

n 

COSTJirst{r) = y^ Thitt * present Jir.stt{r) 

n 

COSTjnext{r) = N Thiti * presentjnextf{r) 



firstjpresenti{r) and next_presente{r) are computed as follows: 



present-fir sti = < 



1 if 1 = 1 

1 if present.firste-i = 1 
A {CHMCe-i = AM 

V CHMCi-i = FM 

V CHMCe-i = NC) 
otherwise 
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presentjnexti ■ 



if 


1 = 1 


if 


presentjnexti-i = 1 




A {CHMCe-i = AM 




V CHMCi-i = FH 




V CHMCe-i = NC) 




otherwise 



5 Experimental results 

In this section, we evaluate the tightness of our static multi-level cache analysis com- 
paratively to the execution in a worst-case scenario. We also evaluate the extra compu- 
tation time caused by the analysis of the cache hierarchy. We first describe the experi- 
mental conditions and then we give and analyze experimental results. 

5.1 Experimental setup 

Cache analysis and WCET estimation. The experiments were conducted on MIPS 
R2000/R3000 binary code compiled with gcc 4.1 with flag OO. The WCETs of tasks 
are computed by the Heptanqj timing analyzer 11], more precisely its Implicit Path 
Enumeration Technique (IPET). The fixpoint analysis is an implementation of the ab- 
stract interpretation approach initially proposed in 1 1 8 1 . The Must, May and Persistence 
analysis are conducted sequentially on a two-level cache hierarchy (LI and L2 caches), 
both caches implementing a LRU replacement policy. The analysis is context sensitive 
(function are analyzed in each different calling context). 

To separate the effect of the caches from those of the parts of the processor micro- 
architecture, WCET estimation only takes into account the contribution of caches to 
the WCET as presented in Section l4~4l The effects of other architectural features are 
not considered. In particular, we do not take into account timing anomalies caused by 
interactions between caches and pipelines, as defined in |8|. The cache classification 
not-classified is thus assumed to have the same worst-case behavior as always-miss 
during the WCET computation in our experiments. 

The computation time measurement is realized on an Intel Pentium 4 3.6 GHz with 
2 GB of RAM. 

Measurement environment. The measure of the cache activities on a worst-case 
execution scenario uses the Nachos educational operating systeno running on top of a 
simulated MIPS processor. We have extended Nachos with a two-level cache hierarchy 
with a LRU replacement policy at both levels. 

Benclimarks. The experiments were conducted on five small benchmarks and two 
tasks from a larger real application (see Table|2]for the application characteristics). All 
small benchmarks are benchmarks maintained by Malardalen WCET research groupj. 
The real tasks are part of the case study provided by the automotive industrial partner 
of the Mascotte ANR projecO to the project partners. 

'Heptane is an open-source static WCET analysis tool available at 
http://www. irisa.fr/aces/software/software. html. 

^Nachos web site, http://www.cs.washington.edu/homes/tom/nachos/ 
^http://www.mrtc. mdh.se/projects/wcet/benchmarks .html 
^http://www.projet-mascotte.org/ 
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5.2 Results 

Precision of the multi-level analysis. In order to determine the tightness of the 
muhi-level analysis, static analysis results are compared with those obtained by ex- 
ecuting the programs in their worse-case scenario. Due to the difficulty to identify the 
input data that results in the worst-case situation in complex programs, we only use the 
simplest benchmarks {matmult, ns, bs, minver, jfdctint) to evaluate the precision of the 
analysis. 

Small LI and L2 instruction caches are used in this part of the performance evalu- 
ation in order that the code of most of the benchmarks (except ns and bs) do not fit into 
the caches. The LI cache is 1KB large, 4-ways associative with 32B lines. We use two 
different L2 caches configurations of 2KB 8-ways associative: one with 64B lines and 
another one with 32B lines. 

To evaluate the precision of our approach, the comparison of the hit ratio at the L2 
level between static analysis and measurement is not appropriate. Indeed, the inherent 
pessimism of the static cache analysis at the LI level introduces some accesses at the 
L2 level that never happen at run-time. Instead, the results are given in Table [3] using 
two classes of metrics: 

— The number of references and the number of misses at every level of the mem- 
ory hierarchy in the worst-case execution scenario (top three lines) to show the 
behavior of the multi-level cache analysis. 

— The contribution of the memory accesses to the WCET (bottom 2 fines) when 
considering a cache hierarchy (L1H-L2) and when ignoring the L2 cache (LI 
only) to demonstrate the usefulness of multi-level analysis. To compute it, we 
use a LI hit cost of 1 cycle, a L2 hit cost of 10 cycles and a memory latency of 
100 cycles. When considering only one cache level, the memory latency is 110 
cycles. 

Two types of behaviors can be observed: 

— The first type of situations is when the number of LI misses computed statically 
is very close to the measured value (benchmark jf/i/cf/nO- In this benchmark, the 
base cache analysis applied to the LI cache is very tight. As a consequence, the 
reference stream considered during the analysis of the L2 cache is very close to 
the accesses actually performed at run-time. Thus, the number of misses in the 
L2 is also very close to the number of L2 misses occuring during execution. In 
this case, the overall difference between static analysis and execution is mainly 
due to the pessimism introduced by considering the cache hierarchy (classifi- 
cation as U of every access that cannot be garanteed to be or not to be in the 
LI). 

— The second type of situations occurs when the static cache analysis at LI level 
is slightly less tight. Then, this behavior is also present at the L2 level and it 
is increased by the introduction of the U accesses. In this case, the multi-level 
analysis is still tight enough. Moreover it turns out that a lot of accesses, not 
detected as hits by the LI analysis, can be detected as hits by the L2 analysis. 
The resulting WCET is thus much smaller than if only one level of cache was 
considered. 

For the largest codes (adpcm, taskl, task!), only results of static cache analysis are 
given (measurements are not realized due to the difficulties to execute these tasks in 
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Benchmark 


Metrics 


Static Analysis 
32B - 64B lines 


Measurement 
32B - 64B lines 


Static Analysis 
32B - 32B lines 


Measurement 
32B - 32B lines 


jfdctlnt 


nb of LI accesses 
nb of LI misses 
nb of L2 misses 


8039 

725 
54 


8039 

723 
49 


8039 
725 
101 


8039 
723 
96 




cache contribution to WCET 
L1+L2, cycles 
LI only, cycles 


20689 
87789 


20169 


25389 
87789 


24869 


bs 


nb of LI accesses 
nb of LI misses 
nb of L2 misses 


196 

16 

15 


196 

11 

6 


196 
16 
16 


196 

11 

11 




cache contribution to WCET 
L1+L2, cycles 
LI only, cycles 


1856 
1956 


906 


1956 
1956 


1406 


mlnver 


nb of LI accesses 
nb of LI misses 
nb of L2 misses 


4146 
150 
108 


4146 
140 
71 


4146 
150 
150 


4146 
140 
140 




cache contribution to WCET 
L1+L2, cycles 
LI only, cycles 


16446 
20646 


12646 


20646 
20646 


19546 


ns 


nb of LI accesses 
nb of LI misses 
nb of L2 misses 


26428 
23 
20 


26411 
13 

7 


26428 

23 
23 


26411 
13 
13 




cache contribution to WCET 
L1+L2, cycles 
LI only, cycles 


28658 
28958 


27241 


28958 
28958 


27841 


matmult 


nb of LI accesses 
nb of LI misses 
nb of L2 misses 


525894 
51 
49 


525894 
41 
19 


525894 
51 
51 


525894 
41 
38 




cache contribution to WCET 
L1+L2, cycles 
LI only, cycles 


531304 
531504 


528204 


531504 
531504 


530104 



Benchmark 


Metrics 


Static Analysis 


Static Analysis 






32B - 64B hues 


32B - 32B lines 


adpcm 


nb of LI accesses 


187312 


187312 




nb of LI misses 


2891 


2891 




nb of L2 misses 


289 


297 




cache contribution to WCET 








L1+L2, cycles 


245122 


245922 




LI only, cycles 


505322 


505322 


taskl 


nb of LI accesses 


1872522 


1872522 




nb of LI misses 


678 


678 




nb of L2 misses 


662 


678 




cache contribution to WCET 








L1+L2, cycles 


1945502 


1947102 




LI only, cycles 


1947102 


1947102 


task2 


nb of LI accesses 


6783 


6493 




nb of LI misses 


792 


796 




nb of L2 misses 


718 


796 




cache contribution to WCET 








L1+L2, cycles 


86503 


94053 




LI only, cycles 


93903 


94053 



Table 3: Precision of the static multi-level n-ways analysis (4-ways LI cache, 8-ways 
L2 cache. Cache sizes of 1KB/2KB in top table, 8KB/64KB in bottom table). 
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Figure 5: Computation time with a 64KB and a 128KB L2 cache 



their worst-case execution scenario). Since code size of these three tasks is larger than 
of the simple benchmarks, the cache size is now larger and more realistic than the one 
considered before. We use a 8KB large LI cache and a 64KB large L2 cache with the 
same cache line sizes and associativity as before. 

We can notice the rather low number of cache hits in the L2 with the L2 cache with 
32B lines. This explains by the size of loops in the applications as compared to the LI 
cache size. In all tasks but adpcm, the code of the loops entirely fits into the LI cache 
and thus there is no reuse once a piece of code gets loaded into the L2 cache. When 
the cache line size in the L2 cache is larger, the number of hits increases significantly, 
due to the spatial locality of applications. 

In summary, the overall tightness of the multi-level cache analysis is strongly de- 
pendent on the initial cache analysis of ifTSl . In all the cases: (i) the extra pessimism 
caused by our multi-level analysis for the sake of safety (introduction of U accesses) 
is reasonable, (ii) considering the cache hierarchy generally results in much lower 
WCETs comparatively to considering only one cache level and an access to main mem- 
ory for each miss. 



Computation time evaluation. The analysis time is evaluated on a two-level cache 
hierarchy, using the three largest codes (adpcm, taskl, and task!) and the same cache 
structures as before. What we wish to evaluate is the extra-cost for analysing the sec- 
ond level of cache comparatively to a traditionnal cache analysis of only one level. 
The extra-analysis time mainly depends on the number of references considered when 
analysing the L2 cache, which itself depends on the size of the LI cache (the larger the 
LI, the higher the number of references detected as hits in the LI and thus the lower 
the number of references considered in the analysis of the L2). Thus, we vary the size 
of the LI (4-ways and cache lines of 32B) from 1KB to L2 cache size. 

Figure IS details the results for 64 KB (32B and 64B Hne) and 128 KB (32B and 
64B line) L2 caches respectively. The X axis gives the LI cache size in KB. The Y axis 
reports the computation time in seconds. 

The shape of the curves are very similar for each used benchmark and each L2 
cache size tested. The computation time for analysing the LI cache increases with the 
size because of the inherent dependency of single-level cache analysis to the cache size. 
However, the computation time increase is not always monotonic, like for instance for 
benchmark adpcm. This non-monotonic behavior comes from a variation of the number 
of iterations in the fixpoint computation present in the single-level cache analysis. In 
contrast, the analysis time of the L2 cache decreases when the LI cache is increased: 
as the LI cache filters more and more memory references, the number of accesses to 
the L2 cache considered in the analysis are reduced (more and more accesses become 
N access). 
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The proposed multi-level cache analysis introduced an extra computation cost for U 
accesses to explore the two possible behavior of uncertain accesses. It can be observed 
that this extra cost is not visible because it is masked by the filtering of accesses. 

When the L2 cache size is 128 KB the slope of the L2 curve is lower than for a 64 
KB cache. This is due to the incompressible time needed for single-level cache analysis 
of the L2 cache, dependent on the L2 cache size, which masks the filtering effect of the 
LI cache. Nevertheless even in this case the computation time is reasonable. 

To conclude, the computation time required for the multi-level set-associative cache 
(LI + L2) analysis is significant but stays reasonable on the case study application. 

5.3 Discussion 

The safety issue of ifTOl is hard to detect on existing codes because of {i) the pessimism 
introduced by the cache analysis at the first cache level which masks the WCET under- 
estimation caused by the safety issue and {ii) the difficulties to execute tasks in their 
worst-case condition. We have implemented the counterexample presented in Section[3] 
which demonstrates that this phenomenon occurs in practise. 

The experiments were undertaken with a LRU replacement policy at each level of 
the cache hierarchy. Nevertheless, the modification of the Update function is done at 
a high level and is independent from any cache replacement policy. 

Finally, experiments were conducted by considering two levels of caches. We did 
not present experiments with a L3 cache due to the difficulty of finding large enough 
publically available codes. Nevertheless, our method allows the analysis of a cache 
hierarchy with more than two levels. 

6 Conclusion 

In this paper we have shown that the previous method to analyze multi-level caches 
for real-time systems |10| is unsafe for set-associative caches. We have proposed a 
solution to produce safe WCET estimations of set-associative cache hierarchy what- 
ever the degree of associativity and the cache replacement policy. We have proven the 
termination of the fixpoint analysis and the experimental results show that this method 
is precise in many cases, generally tighter than considering only one cache level, and 
has a reasonable computation time on the case study. In future research we will con- 
sider unified caches by using for instance partitioning techniques to separate instruction 
from data, and we will extend this approach to analyze cache hierarchies of multicore 
architectures. 
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